lemma 18
A Reduction to no Memory Proofs
We first need the following lemma, which bounds the prediction shifts and magnitudes of Algorithm 2. See proof in Appendix A.2. We are now ready to prove Theorem 9. Proof of Theorem 9. We show that Algorithm 2 achieves the desired regret bound. Lipschitz) where the last transition used the Lipschitz assumption to bound the gradient. This concludes the second part of the lemma. We give a general example of a BCO algorithm that may be employed in conjunction with our reduction procedure given in Algorithm 2. For a positive semi-definite matrix Moreover, for all null null we have that 1. if null The proof of Lemma 15 relies on a few standard results.
A Reduction to no Memory Proofs
We first need the following lemma, which bounds the prediction shifts and magnitudes of Algorithm 2. See proof in Appendix A.2. We are now ready to prove Theorem 9. Proof of Theorem 9. We show that Algorithm 2 achieves the desired regret bound. Lipschitz) where the last transition used the Lipschitz assumption to bound the gradient. This concludes the second part of the lemma. We give a general example of a BCO algorithm that may be employed in conjunction with our reduction procedure given in Algorithm 2. For a positive semi-definite matrix Moreover, for all null null we have that 1. if null The proof of Lemma 15 relies on a few standard results.
Our analysis is significantly more complicated compared to
We thank the reviewers for their careful consideration and their feedback, our replies are provided below. We will add a conclusion section to summarize our paper. NLD is indeed a bit unfortunate but the name "non-reversible" for such dynamics is We will define it earlier than Line 114. We thank the reviewer for the insightful comments. We sincerely apologize for mis-citing Lemma EC.6 in [GGZ18].
Supplementary Materials A Hessian Vector Implementation
We then select those that yield the best convergence performance. However, our code supports GPU cluster training. VRBO becomes slower and less stable. As a result, single-sample based algorithms enable a larger parameter update per sample, and hence achieve a higher sample efficiency. Besides, we apply the standard grid search for the inner-and outer-loop stepsizes for all algorithms.
Minimum width for universal approximation using squashable activation functions
Shin, Jonghyun, Kim, Namjun, Hwang, Geonho, Park, Sejun
The exact minimum width that allows for universal approximation of unbounded-depth networks is known only for ReLU and its variants. In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$, the minimum width is $\max\{d_x,d_y,2\}$ unless $d_x=d_y=1$; the same bound holds for $d_x=d_y=1$ if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions.
Stability of sorting based embeddings
Balan, Radu, Tsoukanis, Efstratios, Wellershoff, Matthias
Consider a group $G$ of order $M$ acting unitarily on a real inner product space $V$. We show that the sorting based embedding obtained by applying a general linear map $\alpha : \mathbb{R}^{M \times N} \to \mathbb{R}^D$ to the invariant map $\beta_\Phi : V \to \mathbb{R}^{M \times N}$ given by sorting the coorbits $(\langle v, g \phi_i \rangle_V)_{g \in G}$, where $(\phi_i)_{i=1}^N \in V$, satisfies a bi-Lipschitz condition if and only if it separates orbits. Additionally, we note that any invariant Lipschitz continuous map (into a Hilbert space) factors through the sorting based embedding, and that any invariant continuous map (into a locally convex space) factors through the sorting based embedding as well.
A Combinatorial Approach to Robust PCA
Kong, Weihao, Qiao, Mingda, Sen, Rajat
We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenario of learning from high-dimensional yet structured data that are transmitted through a highly-noisy channel, so that the data points are unlikely to be entirely clean. Our main result is an efficient algorithm that, when $ks^2 = O(d)$, recovers every single data point up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation. At the core of our proof is a new analysis of the well-known Basis Pursuit (BP) method for recovering a sparse signal, which is known to succeed under additional assumptions (e.g., incoherence or the restricted isometry property) on the underlying subspace $U$. In contrast, we present a novel approach via studying a natural combinatorial problem and show that, over the randomness in the support of the sparse signal, a high-probability error bound is possible even if the subspace $U$ is arbitrary.